Comparing Data Streams via Sketching
نویسندگان
چکیده
We consider the problem of estimating the distance between any two large data streams in smallspace constraint. This problem is of utmost importance in data intensive monitoring applications where input streams are generated rapidly. These streams need to be processed on the fly and accurately to quickly determine any deviance from nominal behavior. We present a new metric, the Sketch ⋆-metric, which allows to define a distance between updatable summaries (or sketches) of large data streams. An important feature of the Sketch ⋆-metric is that, given a measure on the entire initial data streams, the Sketch ⋆-metric preserves the axioms of the latter measure on the sketch (such as the non-negativity, the identity, the symmetry, the triangle inequality but also specific properties of the f -divergence or the Bregman one). Extensive experiments conducted on both synthetic traces and real data sets allow us to validate the robustness and accuracy of the Sketch ⋆-metric. Key-words: Data stream; metric; randomized approximation algorithm. Sketch ⋆-metrique: Comparaison de flots de donnes base sur des rsums (“sketch”) Résumé : Nous étudions le problème li l’estimation de la distance entre de flots de données quelconques sous hypothèse de calcul et mmoire limitée. Ce problme s’avère être très important dans les applications de monitoring où les flots de données sont générés rapidement. Mots clés : Flots de données, algorithme d’approximation randomizé. * CNRS UMR 6074 IRISA, [email protected], CIDRE ** LINA, Université de Nantes, [email protected], ATLAS-GDD c ©IRISA – Campus de Beaulieu – 35042 Rennes Cedex – France – +33 2 99 84 71 00 – www.irisa.fr ha l-0 07 64 77 2, v er si on 1 13 D ec 2 01 2 2 Emmanuelle Anceaume Yann Busnel
منابع مشابه
Sketch ?-metric: Comparing Data Streams via Sketching RESEARCH REPORT
In this paper, we consider the problem of estimating the distance between any two large data streams in smallspace constraint. This problem is of utmost importance in data intensive monitoring applications where input streams are generated rapidly. These streams need to be processed on the fly and accurately to quickly determine any deviance from nominal behavior. We present a new metric, the S...
متن کاملSketch \star-metric: Comparing Data Streams via Sketching
In this paper, we consider the problem of estimating the distance between any two large data streams in smallspace constraint. This problem is of utmost importance in data intensive monitoring applications where input streams are generated rapidly. These streams need to be processed on the fly and accurately to quickly determine any deviance from nominal behavior. We present a new metric, the S...
متن کاملCorrections to “LD-Sketch: A Distributed Sketching Design for Accurate and Scalable Anomaly Detection in Network Data Streams”
In this article, we describe the corrections to our paper “LD-Sketch: A Distributed Sketching Design for Accurate and Scalable Anomaly Detection in Network Data Streams” published at IEEE INFOCOM 2014. We also clarify the complexity issue raised by some readers. 1 Corrections to Lemmas and Theorems
متن کاملImproved Sketching of Hamming Distance with Error Correcting
We address the problem of sketching the hamming distance of data streams. We present a new notion of sketching technique, Fixable sketches and we show that using such sketch not only we reduce the sketch size, but also restore the differences between the streams. Our contribution: For two streams with hamming distance bounded by k we show a sketch of size O(k logn) with O(logn) processing time ...
متن کاملAlgorithmic Techniques for Processing Data Streams
We give a survey at some algorithmic techniques for processing data streams. After covering the basic methods of sampling and sketching, we present more evolved procedures that resort on those basic ones. In particular, we examine algorithmic schemes for similarity mining, the concept of group testing, and techniques for clustering and summarizing data streams. 1998 ACM Subject Classification F...
متن کامل